Fast Algorithms for Learning with Long N-grams via Suffix Tree Based Matrix Multiplication

نویسندگان

  • Hristo S. Paskov
  • John C. Mitchell
  • Trevor J. Hastie
چکیده

This matrix format is inefficient when storing frequency data since we know all entries in x are non-negative integers. Moreover, the number of bits needed to store each index in the jc array is dlog2 nze which can be significantly larger than dlog2 Ue where U is the largest number of non-zero elements in any column. Our modified CSC format simply replaces the jc array with an integer array of size N that stores the number of non-zero elements in each column and it replaces x by an integer array of frequency counts. This modifications can lead to substantial savings when appropriate.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On-Line Cumulative Learning of Hierarchical Sparse n-grams

We present a system for on-line, cumulative learning of hierarchical collections of frequent patterns from unsegmented data streams. Such learning is critical for long-lived intelligent agents in complex worlds. Learned patterns enable prediction of unseen data and serve as building blocks for higher-level knowledge representation. We introduce a novel sparse n-gram model that, unlike pruned n-...

متن کامل

Substring Count Estimation in Extremely Long Strings

To estimate the number of substring matches against string data, count suffix trees (CS-tree) have been used as a kind of alphanumeric histograms. Although the trees are useful for substring count estimation in short data strings (e.g. name or title), they reveal several drawbacks when the target is changed to extremely long strings. First, it becomes too hard or at least slow to build CS-trees...

متن کامل

Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth

Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...

متن کامل

Generating Random Spanning Trees via Fast Matrix Multiplication

We consider the problem of sampling a uniformly random spanning tree of a graph. This is a classic algorithmic problem for which several exact and approximate algorithms are known. Random spanning trees have several connections to Laplacian matrices; this leads to algorithms based on fast matrix multiplication. The best algorithm for dense graphs can produce a uniformly random spanning tree of ...

متن کامل

Modeling Algorithm Performance on Highly-threaded Many-core Architectures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Examples of Highly-threaded Many-core Architectures . . . . . . . . . . . . 4 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Methodology for Performance Modeling . . . . . . . ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015